Introduction

We used the R package CoordinateCleaner to flag potentially erroneous, suspect, or imprecise geographical coordinates based on geographic gazetteers and metadata. It includes a series of tests for identifying records assigned to capitals, provinces, and country centroids, coordinates in urban areas, around biodiversity institutions, or GBIF headquarters. It also contains tests to flag coordinates below a determined precision (e.g., 100 km), zero or equal coordinates, and duplicated records (i.e., equal taxa name and coordinates).

Note that we do not use the “seas” test to remove records in the ocean because such records we previously removed in the pre-filter module of the package (more details here).


Important:

The results of each test used to flag data quality are appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records.

Installation

You can install the released version of bdc from github with:

if (!require("remotes")) install.packages("remotes")
if (!require("bdc")) remotes::install_github("brunobrr/bdc")

Creating directories

Create directories to save the results of the functions. If not existing, four new folders are created in a folder named ‘Output’.

Reading the database

Reading the database created in the taxonomy module the bdc package. It is also possible to read any datasets containing the required fields to run the function (more details here).

database <-
  qs::qread(here::here("Output/Intermediate/02_taxonomy_database.qs"))

Standardization of character encoding.

for (i in 1:ncol(database)){
  if(is.character(database[,i])){
    Encoding(database[,i]) <- "UTF-8"
  }
}



Flagging common spatial issues

check_space <-
  CoordinateCleaner::clean_coordinates(
    x =  database,
    lon = "decimalLongitude",
    lat = "decimalLatitude",
    species = "scientificName",
    countries = ,
    tests = c(
      "capitals",     # records within 2km around country and province centroids
      "centroids",    # records within 1km of capitals centroids
      "duplicates",   # duplicated records
      "equal",        # records with equal coordinates
      "gbif",         # records within 1 degree (~111km) of GBIF headsquare
      "institutions", # records within 100m of zoo and herbaria
      "outliers",     # outliers
      "zeros",        # records with coordinates 0,0
      "urban"         # records within urban areas
    ),
    capitals_rad = 2000,
    centroids_rad = 1000,
    centroids_detail = "both", # test both country and province centroids
    inst_rad = 100, # remove zoo and herbaria within 100m
    outliers_method = "quantile",
    outliers_mtp = 5,
    outliers_td = 1000,
    outliers_size = 10,
    range_rad = 0,
    zeros_rad = 0.5,
    capitals_ref = NULL,
    centroids_ref = NULL,
    country_ref = NULL,
    country_refcol = "countryCode",
    inst_ref = NULL,
    range_ref = NULL,
    # seas_ref = continent_border,
    # seas_scale = 110,
    urban_ref = NULL,
    value = "spatialvalid" # result of tests are appended in separate columns
  )
#> Testing coordinate validity
#> Flagged 0 records.
#> Testing equal lat/lon
#> Flagged 0 records.
#> Testing zero coordinates
#> Flagged 1 records.
#> Testing country capitals
#> Flagged 10 records.
#> Testing country centroids
#> Flagged 10 records.
#> Testing urban areas
#> Downloading urban areas via rnaturalearth
#> OGR data source with driver: ESRI Shapefile 
#> Source: "C:\Users\Bruno Ribeiro\AppData\Local\Temp\RtmpiUvCvx", layer: "ne_50m_urban_areas"
#> with 2143 features
#> It has 4 fields
#> Integer64 fields read as strings:  scalerank
#> Flagged 279 records.
#> Testing geographic outliers
#> Flagged 10 records.
#> Testing GBIF headquarters, flagging records around Copenhagen
#> Flagged 0 records.
#> Testing biodiversity institutions
#> Flagged 11 records.
#> Testing duplicates
#> Flagged 97 records.
#> Flagged 389 of 6112 records, EQ = 0.06.

Flagging coordinates with low decimal precision

Here we Identify records with a coordinate precision below a specified number of decimal places. For example, the precision of a coordinate with 1 decimal place is 11.132 km at the equator, i.e., the scale of a large city.

check_space <-
  bdc_coordinates_precision(
    data = check_space,
    lon = "decimalLongitude",
    lat = "decimalLatitude",
    ndec = c(0, 1) # number of decimals to be tested
  )
#> bdc_coordinates_precision:
#> Flagged 50 records
#> One column was added to the database.

Mapping spatial errors

It is possible to map a column containing the results of one spatial test each time. For example, let’s map records in centroids of countries or provinces.

check_space %>%
  dplyr::filter(.cen == FALSE) %>%
  bdc_quickmap(
    data = .,
    lon = "decimalLongitude",
    lat = "decimalLatitude",
    col_to_map = ".cen",
    size = 0.7
  )


Coordinates in country and province centroies

Report

Here we create a column named .summary summing upthe results of all tests. This column is FALSE if any test was flagged as “FALSE” (i.e. potentially invalid or suspect record).

check_space <- bdc_summary_col(data = check_space)
#> Column '.summary' already exist. It will be updated
#> 
#> bdc_summary_col:
#> Flagged 466 records.
#> One column was added to the database.



Creating a report summarizing the results of all tests.

report <-
  bdc_create_report(data = check_space,
                    database_id = "database_id",
                    workflow_step = "space")

report


Figures

Here we create figures (bar plots and maps) to make the interpretation of the results of data quality tests easier.

bdc_create_figures(data = check_space,
                   database_id = "database_id",
                   workflow_step = "space")


Rounded coordinates (% of each database flagged)


Records within biodiversity institutions


Summary of all tests


Filtering the database

It is possible to remove flagged records (potentially problematic ones) to get a ‘clean’ database (i.e., without test columns starting with “.”). However, to ensure that all records will be evaluated in all the data quality tests (i.e., tests of the taxonomic, spatial, and temporal module of the package), potentially erroneous or suspect records will be removed in the final module of the package.

# output <-
#   check_space %>%
#   dplyr::filter(.summary == TRUE) %>%
#   bdc_filter_out_flags(data = ., col_to_remove = "all")

Saving the database

check_space %>%
  qs::qsave(.,
            here::here("Output", "Intermediate", "03_space_database.qs"))